Code Author: Lance Wrobel
In-progess individual project on unsupervised learning methods applied to online retail data.
Methods used: Association Rule Mining, Kmeans Clustering, Hierarchical Clustering
The dataset used is from the UCI Machine Learning Repository, the link is below: http://archive.ics.uci.edu/ml/datasets/Online+Retail
source("RetailAnalysisFunctions.R") # contains the functions I wrote which are called from this notebook
retail_dataset <- read.csv("OnlineRetail.csv")
retail_dataset <- as_tibble(retail_dataset) # makes it easier to print the dataset and check columns
colnames(retail_dataset)[[1]] <- "InvoiceNumber" # this column name was orginally read in wrong
colnames(retail_dataset)[[5]] <- "InvoiceDateAndTime"
The first section of this R Notebook performs a Association Rule Mining Analysis.
The following code converts the retail dataset into a transactions object which can be used in the ‘arules’ package. This function uses the entire product decription as the item obtained in the transaction.
transactions<-convert_to_transactions(retail_dataset,description_start_end=c(0,0))
itemFrequencyPlot(transactions, topN=10, col=brewer.pal(8,'Pastel1'), type="absolute", main="Item Frequency")
rules1 <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.60))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 259
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4223 item(s), 25900 transaction(s)] done [0.45s].
## sorting and recoding items ... [591 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.04s].
## writing ... [179 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
plot(rules1, engine = "htmlwidget")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules1, method = "graph", engine = "htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 100 rules using
## lift (change control parameter max if needed)
transactions_2<-convert_to_transactions(retail_dataset,description_start_end=c(3,6))
rules2 <- apriori(transactions_2, parameter = list(supp = 0.005, conf = 0.70))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 129
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[724 item(s), 25900 transaction(s)] done [0.04s].
## sorting and recoding items ... [280 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
plot(rules2, method = "graph", engine = "htmlwidget")
The next section of this R notebook performs Kmeans and Hierarchical Clustering for customer segmentation.
The first step is to create a customer profile vector for each customer based off their purchasing habits. This code chunk contains only some basic features as a start. The following features are used per customer: total number of transactions, average quantity purchased, average transaction cost, and the number of distinct weeks a transaction was placed (for a consistancy feature).
retail_dataset_mutated <- retail_dataset %>% mutate(transaction_cost = UnitPrice*Quantity) %>%
separate(InvoiceDateAndTime, c("InvoiceDate","InvoiceTime")," ")
retail_dataset_mutated$InvoiceDate <- as.Date(retail_dataset_mutated$InvoiceDate,"%m/%d/%Y")
retail_dataset_mutated$DateFirstOrder <- with(retail_dataset_mutated, ave(InvoiceDate,CustomerID, FUN = min))
retail_dataset_mutated$DateLastOrder <- with(retail_dataset_mutated, ave(InvoiceDate,CustomerID, FUN = max))
retail_dataset_mutated$week <- week(retail_dataset_mutated$InvoiceDate)
customer_profiles <- retail_dataset_mutated %>% group_by(CustomerID) %>%
summarise(avg_cost_of_transaction = mean(transaction_cost),avg_quantity_bought = mean(Quantity),total_transactions = n(),
distinct_weeks_of_a_transaction = n_distinct(week)) %>% select(-CustomerID)
Focus in on customers who don’t spend very large amounts or buy very large quantities.
customer_profiles_no_outliers<-customer_profiles %>% filter(between(avg_quantity_bought, 0,50), between(avg_cost_of_transaction,0, 60),between(total_transactions, 1,50))
Below I use Kmeans clustering with k=4 based on the customer profiles.
k_means <- kmeans(customer_profiles_no_outliers,4)
plot(k_means,data=customer_profiles_no_outliers)
I now do some Hierarchical Clustering using only average cost of transaction and number of total transactions as customer profile features.
library(stats)
# dont use consistancy feature or quantity feature for visualization purposes
customer_profiles_subset <- customer_profiles_no_outliers %>% select(avg_cost_of_transaction, total_transactions)
clusters<-hclust(dist(customer_profiles_subset))
clusterCut <- cutree(clusters, 3)
ggplot(customer_profiles_subset, aes(avg_cost_of_transaction, total_transactions)) +
geom_point() + geom_point(col = clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))